Complex trait methylation scores in the prediction of major depressive disorder

Summary Background DNA methylation (DNAm) is associated with time-varying environmental factors that contribute to major depressive disorder (MDD) risk. We sought to test whether DNAm signatures of lifestyle and biochemical factors were associated with MDD to reveal dynamic biomarkers of MDD risk that may be amenable to lifestyle interventions. Methods Here, we calculated methylation scores (MS) at multiple p-value thresholds for lifestyle (BMI, smoking, alcohol consumption, and educational attainment) and biochemical (high-density lipoprotein (HDL) and total cholesterol) factors in Generation Scotland (GS) (N=9,502) and in a replication cohort (ALSPACadults, N=565), using CpG sites reported in previous well-powered methylome-wide association studies. We also compared their predictive accuracy for MDD to a MDD MS in an independent GS sub-sample (N=4,432). Findings Each trait MS was significantly associated with its corresponding phenotype in GS (βrange=0.089–1.457) and in ALSPAC (βrange=0.078–2.533). Each MS was also significantly associated with MDD before and after adjustment for its corresponding phenotype in GS (βrange=0.053–0.145). After accounting for relevant lifestyle factors, MS for educational attainment (β=0.094) and alcohol consumption (MSp-value<0.01–0.5; βrange=-0.069–0.083) remained significantly associated with MDD in GS. Smoking (AUC=0.569) and educational attainment (AUC=0.585) MSs could discriminate MDD from controls better than the MDD MS (AUC=0.553) in the independent GS sub-sample. Analyses implicating MDD did not replicate across ALSPAC, although the direction of effect was consistent for all traits when adjusting for the MS corresponding phenotypes. Interpretation We showed that lifestyle and biochemical MS were associated with MDD before and after adjustment for their corresponding phenotypes (pnominal<0.05), but not when smoking, alcohol consumption, and BMI were also included as covariates. MDD results did not replicate in the smaller, female-only independent ALSPAC cohort (NALSPAC=565; NGS=9,502), potentially due to demographic differences or low statistical power, but effect sizes were consistent with the direction reported in GS. DNAm scores for modifiable MDD risk factors may contribute to disease vulnerability and, in some cases, explain additional variance to their observed phenotypes. Funding Wellcome Trust.


Evidence before this study
Major depressive disorder (MDD) is a prevalent psychiatric disorder that is known to result from a complex combination of genetic and environmental risk factors. Polygenic risk scores only account for approximately 1.5-3.2% of the variance in MDD, and previous evidence has also shown associations with a number of lifestyle risk factors, including alcohol consumption, smoking, and body mass index. These factors are also known to have widespread effects on the methylome. In addition, differential DNA methylation has recently been associated with MDD, although the variance explained in the disorder remains small.
Although there is evidence of DNA methylation links to both MDD and environmental factors, the epigenetic signature of these factors in relation to MDD has not been investigated thus far. To assess the existing evidence for epigenetic signatures of environmental risk factors for MDD and their association with MDD, we searched Google Scholar for studies from inception to 2021, using the following search terms: "MDD OR DNA methylation environmental risk factors", "MDD OR lifestyle factors DNA methylation", "MDD OR MDD environmental risk factors", "epigenome-wide association studies of smoking OR alcohol OR BMI OR MDD", "Methylation risk scores AND smoking AND BMI AND alcohol AND MDD". We also examined reference lists and citations of relevant publications. We did not find any studies that looked specifically at the epigenetic signatures of lifestyle and environmental risk factors for MDD and associations with MDD. We therefore sought to investigate these associations in the current study.

Added value of this study
To our knowledge, this is one of a few studies to look at epigenetic signatures of lifestyle and biochemical factors that confer risk to MDD and their association with MDD in two large, population-based cohorts. Using previous large-scale epigenome-wide association studies, we report associations between 6 complex traits and their epigenetic signature in both cohorts investigated here. In our main cohort, we further report associations between the epigenetic signature of the 6 complex traits and MDD, although these associations become non-significant when accounting for further lifestyle variables. Our MDD results were not replicated in the second cohort. Our findings here indicate that lifestyle factors attenuate the relationship between the epigenetic signature of MDD-relevant environmental risk factors and MDD. The study highlights the importance of lifestyle factors in MDD-DNA methylation associations.

Implications of all the available evidence
Our findings suggest that, although there are associations between MDD and a number of environmental variables, the association between their epigenetic signature and MDD is attenuated when considering a number of lifestyle factors. Investigating the epigenetics of disease-relevant modifiable factors may uncover useful biomarkers for disease stratification as well as treatment options that may be responsive to lifestyle modifications. However, the relationship between DNA methylation and MDD is incompletely understood, and future studies, both cross-sectional and longitudinal, will be able to shed light on the trajectory of DNA methylation in relation to both lifestyle factors and MDD.

Introduction
Major depressive disorder (MDD) is a prevalent psychiatric disorder and is a leading cause of disability worldwide. 1 MDD is moderately heritable (h 2 =37%) and is known to result from a complex combination of genetic and environmental risk factors. 1 Polygenic risk scores (PRS) derived from large-scale genome-wide association studies (GWAS) explain approximately 1.5À3.2% of MDD risk in independent cohorts. 2 In addition, a number of modifiable lifestyle factors are known to associate with MDD, including alcohol intake, smoking, sleeping pattern, diet, and body mass index (BMI). 3,4 Recently, methylome-wide association studies (MWAS) have begun to identify depressive symptom associations with differential DNA methylation (DNAm) at cytosine-phosphate-guanine (CpG) sites annotated to genes implicated in disorder-and neuralrelated traits. 5,6 Further, methylation scores (MS) for MDD explain additional variance in the disorder when modelled alongside PRS and risk-associated environmental factors, such as smoking, alcohol consumption, and BMI. 7,8 However, a large proportion of variance in MDD remains unexplained after accounting for MDD genetic and methylation risk alongside environmental factors.
Recent studies using both methylome-wide association and penalised regression methods have identified DNAm markers for modifiable lifestyle factors, that are measured peripherally in whole blood and can be used for MS estimation. 9À13 There are now well-established MWAS for a number of lifestyle factors that are relevant to MDD, including smoking, 13 BMI, 10 and alcohol consumption. 11 In addition, using penalised regression, McCartney et al. showed that DNAm predictors for complex traits, including BMI, smoking, educational attainment, and total and HDL cholesterol increased the variance explained in these traits when modelled alongside PRS. 14 This finding is of interest in the application to multifactorial diseases, where modelling PRS alongside MS for relevant risk factors may enhance prediction. For instance, a recent study showed that a risk model combining lung cancer PRS, a smoking-associated MS, and environmental factors such as pack years predicted lung cancer with a higher accuracy than models including individual scores (AUC PRS =0.571, AUC MS =0.628, AUC joint =0.654), with the increase in AUC being mostly attributable to the MS. 15 The study indicates that calculating MS for disease-relevant environmental factors may uncover biomarkers for disease stratification and treatment 15 that may be responsive to lifestyle modifications.
Although several environmental factors with widespread effects on the methylome are known to be associated with MDD, the associations between their epigenetic signatures and MDD has not yet been investigated. Risk prediction models including methylation scores for dynamically changing MDD-associated environmental variables have the potential to increase prediction accuracy for the disorder by capturing an archive of longitudinal exposure. In addition, DNAm biomarkers based on environmental risk factors may lead to the development of novel techniques to measure the efficacy of lifestyle interventions more rapidly, providing potentially useful feedback to both clinicians and patients.
Here, we selected four lifestyle factors (smoking status, alcohol consumption, BMI, educational attainment) and 2 biochemical variables (total cholesterol, high-density lipoprotein (HDL) cholesterol) in N=9,502 individuals in Generation Scotland: the Scottish Family Health Study (GS) that are phenotypically associated with MDD. 3,4 We then conducted a literature search to identify well-powered MWAS of these traits.
The aim of the current study was to compute MSs for these MDD-associated risk factors using methylome-wide significant CpGs. 9À13 For those variables where full summary statistics were available (alcohol consumption, 11 educational attainment, 12 smoking status 13 ), we additionally calculated MS using four additional p-value thresholds, including p < 0.01, 0.05, 0.1, and 0.5 to investigate whether MSs that include a larger number of CpGs would increase prediction. Associations between the MSs and MDD and their corresponding phenotypes were assessed in 9502 individuals in GS. We further split GS into a training (N=5,078) and testing (N=4432) sample to calculate a MDD MS and compared this to complex trait MS in the testing GS sub-sample (N=4432).
We used an age-matched subset of mothers (mean age=47.96 years, N=565) in the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort to replicate findings in GS. To investigate further concordant signals with MDD, we assessed whether single nucleotide polymorphisms (SNPs) associated with methylation at CpG sites comprising each MS (mQTLs) were colocalised with SNPs associated with MDD.

Training panels
To conduct our analyses, we included summary-level data from 6 previous MWASs (educational attainment, HDL cholesterol, total cholesterol, smoking status, alcohol consumption, and BMI 9À13 ). For educational attainment, smoking status, and alcohol consumption, full summary statistics were available and obtained directly from the respective authors. For BMI, HDL and total cholesterol, methylome-wide summary statistics were obtained from the EWAS catalog (http://www.ewascata log.org/), after permission was obtained from the EWAS Catalog team. 16 Further information regarding each MWAS, including cohort selection, statistical analysis, and demographic information, is available in the Supplementary Table 1 and in each study. 9À13

Target panels
Generation Scotland À Scottish family health study (GS). GS is a family-based population cohort aiming to investigate the genetic and environmental causes of health and disease in approximately 24,000 participants aged 18À98 years in Scotland. Baseline data was collected between 2006 and 2011 and includes detailed information on a broad range of variables, including lifestyle and environmental factors, mental health, and medication. 17,18 DNA is also available from blood samples taken at the time of recruitment from more than 20,000 consenting participants.
Avon longitudinal study of parents and children (ALSPAC). ALSPAC is a population-based study in the South-West of England aiming to investigate the effects of multiple factors on health and development. Pregnant women were recruited between April 1991 and December 1992, with the initial number of pregnancies enrolled being 14,541. The cohort now consists of 13,761 mothers, their partners, and their 14,901 children (now young adults). 19À21 The main replication sample in the current study comes from the mothers' follow-up timepoint (mean age=47.96; see Table 2 for further demographic characteristics). 22 Further information regarding the sample is given in the Supplemental Materials.

Ethics
GS received ethical approval from NHS Tayside Research Ethics Committee (REC reference number 05/S1401/89) and has Research Tissue Bank Status (reference: 20/ES/0021). Written informed consent was obtained from all participants.
ALSPAC received ethical approval from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Written informed consent was obtained from all participants and consent for biological samples has been collected in accordance with the Human Tissue Act (2004). Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool: http://www.bristol.ac.uk/alspac/researchers/our-data/.
Phenotypes GS. MDD status was measured using the axis-I Structured Clinical Interview of the Diagnostic and Statistical Manual, version IV (SCID) and was administered to participants who answered "yes" to either of two screening questions (N=1626, see Supplementary Materials). Control participants were defined as those individuals who answered "no" to the two screening questions (see Supplementary Materials) or did not fulfil criteria for a diagnosis of current or previous MDD following the SCID interview (N=7876). Individuals fulfilling criteria for bipolar disorder or those who self-reported either bipolar disorder or schizophrenia (N=11) were excluded.
Educational attainment was measured by asking participants: "What is the highest educational qualification you have obtained?" with nine available answers, detailed in the Supplementary Materials. BMI was computed using height (cm) and weight (kg) as measured by clinical staff at baseline recruitment. Participants reported their smoking status (never, former, current) as well as the number of units of alcohol consumed during the past week. Finally, concentrations of HDL and total cholesterol in blood were measured at baseline by mmol/L.

ALSPAC. MDD was measured using the Edinburgh
Postnatal Depression Scale (EPDS). 23 Briefly, participants were asked to mark the response closest to how they have been feeling in the past 7 days on a 10-item scale, where the total score is 30 and a score above 13 indicates MDD. 23 We transformed the scores into a binary variable, where MDD cases were those who scored above 13 (N=67) and controls were those with a total score of 13 (N=498).
BMI was computed using height (cm) and weight (kg) as measured by clinical staff at baseline recruitment. Participants reported whether they currently smoke as well as alcohol consumption frequency (see Supplementary Materials). Concentrations of HDL and total cholesterol were measured by mmol/L. Educational attainment was recorded by asking participants: "What is the highest educational qualification you have obtained?" with six available answers, detailed in the Supplementary Materials.

DNA methylation
GS. Genome-wide DNAm data profiled from whole blood samples was available for 9,537 individuals in GS using the Illumina Human-MethylationEPIC Bead-Chip. 24 The DNAm data was initially released in two sets (set 1 N =5,087; set 2 N =4,450). DNAm data was preprocessed and quality checked for all individuals in the present study, including participant removal due to a number of reasons, including sex mismatch (N removed =24), having more than 1% CpG sites with a detection p-value>0.05 (N removed =52), showing evidence of dye bias, being an outlier for bisulphite conversion control probes (N removed =1), having a median methylated signal intensity more than 3 standard deviations lower than expected (N removed =74), and other technical issues (N removed =602). A total of 10,495 CpG sites were removed due to low beadcount, poor detection p-value, and sub-optimal binding. Firstly, R package "minfi" was used to read in the IDAT files, compute M and beta values, and remove probes with large detection p-values, and to compute principal components (PC) of control probes (see Supplementary Tables 2 and 3). Secondly, correction was applied for 1 technical variation, where M values were included as outcome variables in a mixed linear model adjusting for appointment date and Sentrix ID (random effects), jointly with Sentrix position, batch, clinic, year, weekday, and 10 PCs (fixed effects); and 2 biological variation by fitting residuals of 1 as outcome variables in a second mixed linear model adjusting for genetic and common family shared environmental contributions (random effects classed as G: common genetic; K: kinship; F: nuclear family; C: couple; and S: sibling) and sex, age, and estimated cell types proportions (CD8T, CD4T, NK, Bcell, Mono, Gran) (fixed effects). 25 Cross-reactive (N=42,558) and polymorphic (N=10,971) CpGs, obtained from McCartney et al. were removed from the final dataset, resulting in 674,246 CpGs across the 22 autosomes. 26 ALSPAC. The Illumina Infinium HumanMethyla-tion450 Beadchip 27 was used for measuring genomewide DNA methylation data from blood sample for all samples. The R package "mefill" was used for pre-processing and normalisation. 28 Probes were removed based on background detection (p>0.05) and if they reach beyond the 3 times inter-quantile range from 25% and 75%. R function "beta2m" from the "lumi" package 29 was used for M-value transformation. Cross-reactive and polymorphic CpGs (N=34,881), identified by Chen et al. were removed, resulting in 447,975 CpGs across the 22 autosomes remaining for analysis. 30

Statistical methods
All analyses were conducted using R (version 3.6.3) in a Linux environment. The R code for the current analyses is available in the Supplementary RMarkdown File.

MDD-relevant factor selection. To identify risk factors
for MDD for building MSs, we first ran logistic regression models to identify nominal associations between environmental, lifestyle, and biochemical variables available in GS, included as predictor variables in separate models, and MDD, included as an outcome variable. The complete list of investigated variables is included in the Supplementary Table 4. Age and sex were included as covariates in all models.
Following this, we conducted a literature search to identify previous MWASs where DNAm signatures of significantly associated factors were uncovered (see Supplementary Materials). To meet inclusion criteria, studies needed to use the 450K or EPIC array in peripheral blood in an adult population; provide access to methylome-wide findings where full summary statistics were not available; and include smoking status as a covariate where relevant. GS was not included in any of the MWASs. ALSPAC was included in the BMI MWAS as a replication cohort, 10 however, we used weights from the discovery cohort to calculate the BMI MS. We identified and calculated MSs for six factors: total cholesterol, HDL cholesterol, educational attainment, smoking status, alcohol consumption, and BMI. 9À13 MS calculation. All previous MWASs were conducted using the Illumina 450K array. The overlap between CpGs identified in previous MWASs and CpGs available in GS and ALSPAC is presented in Supplementary  Table 5. For each trait, MSs were calculated for all individuals in GS and ALSPAC with available DNAm data (N=9,502 and N=565, respectively) by taking the sum of the product of methylome-wide significant CpGs and their estimated weights in each MWAS. 9À13 Where full summary statistics were available (educational attainment, smoking status, alcohol consumption), we also calculated MS at 4 further p-value thresholds: 0.01, 0.05, 0.1, and 0.5. 11À13 In GS, we calculated a MDD MS to compare to each complex trait MS. As there are no previous well-powered MWAS of MDD to date, we split the GS sample into a training (N=5078) and testing (N=4432) sample by set, as detailed above. We then applied least absolute shrinkage and selection operator (LASSO) penalised regression on 450K array CpG sites measured in the individuals in the training sample. Briefly, depression status was regressed on age, sex and 10 genetic principal components, as in previous studies, 31 and extracted residuals from this model were input as the dependent variable in the regression model. Tenfold cross-validation was applied, and the mixing parameter was set to 1 for our LASSO penalty.
Association of MS with corresponding traits. In both cohorts, the associations between each MS and their corresponding traits were assessed using logistic, linear, and ordinal logistic regression (depending on each trait, included as an outcome variable). Each MS was included as a predictor variable in separate models. Technical and biological variables (GS: age and sex; ALSPAC: age, 20 methylation PCs, and estimated proportions for five white blood cell types (CD4T, CD8T, natural killer cells, B-cells, Granulocytes, estimated using the Houseman method 32 )) were included as covariates in each model. In GS, methylation PCs and cell type estimations were regressed out during pre-processing of the DNAm data and were therefore not included as covariates in downstream analyses. In ALSPAC, we calculated methylation PCs by first residualizing DNAm data on age, sex, and array, and then applying principal component analysis (PCA) on the residualised data. McFadden's R 2 was calculated to determine the proportion of variance in each trait explained by each MS.
Association of MS with MDD. We then tested whether each MS was associated with MDD in GS (N=9502) using logistic regression, where MDD was included as an outcome variable, and each MS was included as a predictor variable. Three statistical models were performed for each MS individually, differing in covariates included. The example below is demonstrated using BMI MS, and these models were repeated for all other MS: Model 1: MDD » age + sex + BMI MS, where the association between each MS and MDD without confounding variables was assessed.
Model 2: MDD » age + sex + BMI + BMI MS, where, for each MS, its corresponding phenotype in GS was included to estimate how much variance each MS explains in MDD when adjusting for its corresponding phenotypic measure.
Model 3: MDD » age + sex + BMI + smoking status + pack years + alcohol consumption + BMI MS, where further lifestyle factors, known to associate with both MDD and DNAm, were included to observe the proportion of variance explained by the MS when adjusting for these factors.
The same models were run in ALSPAC (N=565), with differing technical and biological covariates: age, 20 methylation PCs, and estimated proportions for five white blood cell types (CD4T, CD8T, natural killer cells, B-cells, granulocytes). BMI, alcohol consumption, and smoking status were included in model 3 as lifestyle factors in ALSPAC.
In a subset of GS (N=4432), which was created by splitting the sample into a training and testing sample, we further investigated whether an MDD MS would explain more variance in MDD than complex trait MS. The area under the curve was calculated for each MS and a ROC curve showing the ability of each score to discriminate between MDD cases and controls is shown in Figure 3 below.
Finally, we conducted sensitivity analyses by 1 stratifying the GS sample by sex and running models 2 and 3 in a women-only sample (N=5615, MDD cases=1163), as the ALSPAC sample consisted of women only; and 2 including smoking status as a covariate in model 1 described above, to observe whether this attenuates the relationship between complex trait MS and MDD due to the widespread effect of smoking on the methylome. 33 Colocalization analysis. We hypothesised that some CpG sites included in the complex trait MS will have shared variants with MDD-associated SNPs. We used Howard et al.'s MDD GWAS for MDD-associated SNPs and GoDMC summary statistics (http://www.godmc. org.uk/) for mQTL analysis. 2,34 We used the package "gwasglue" (https://mrcieu.github.io/gwasglue/) to extract SNPs that were +/-1Mb of each of the 102 genome-wide significant SNPs identified in Howard et al. and then extracted the same SNPs within those regions from the GoDMC mQTL analysis. We used the "coloc.abf" function with default parameters in the "coloc" package in R to perform colocalization analysis for each SNP association. 35  For regions of interest with a posterior probability of >0.5, we performed a manual look-up to identify whether any of the loci in these regions colocalize with genetic variation influencing CpG sites that comprise MS for the 6 complex traits investigated here, including smoking, alcohol consumption, educational attainment, BMI, HDL and total cholesterol.

Role of funding sources
Our funding sources were not involved in the study preparation/design, analysis/interpretation of data, or in the writing and submission of this report.

Demographic characteristics
In GS, there were N=9,502 individuals included in the final analyses (Table 1). In ALSPAC, there were N=565 individuals included in the final analyses ( MDD-relevant factor selection. Prior to identifying wellpowered MWASs of potential environmental risk factors, in those individuals with available DNAm data in GS (N=9,502) we ran regression models where age and sex were included as covariates, to measure associations of environmental, lifestyle, and biochemical variables with MDD. All variables investigated, as well as results from regression analyses, are available in Supplementary  Table 4. Table 3 below indicates results for those variables that were nominally associated with MDD in GS and were also identified as having an established DNAm signature in previous well-powered MWAS. Educational attainment was not associated with MDD in GS, however it has been widely investigated in relation to DNAm and was therefore included in subsequent analyses here.
MS and corresponding traits. Each MS, which was included as a predictor, was first investigated in relation to its corresponding phenotype, as outcome, in regression models, along with technical and biological covariates (GS: age and sex; ALSPAC: age, 20 methylation PCs, and 5 cell types (CD4T, CD8T, natural killer cells, B-cells, granulocytes)). All MSs were associated with their phenotypic counterparts in both GS and ALSPAC. MS and MDD. We then examined each MS, as a predictor, with MDD as the outcome, in logistic regression models.    As the replication analyses in ALSPAC consisted of women only, we further stratified the GS sample by sex and ran models 2 and 3 in a women-only sample (N=5615, MDD cases=1163), with results available in Supplementary Table 9. Briefly, analyses restricted to women in GS showed similar results to the sex-adjusted analyses in GS, where MS were associated with MDD after adjustment for their phenotypic counterparts, but not when including further lifestyle factors.
In addition, due to the known effects of smoking on the methylome, 33 we included smoking status as a covariate in model 1 for all non-smoking traits to identify whether this attenuates the relationship between MDD and complex trait MS without adjusting for other covariates. Results are available in Supplementary Table 10. In both GS and ALSPAC, the effect for all complex trait MS was attenuated by the inclusion of smoking. In GS, all complex trait MS remained significant in their association with MDD, except for educational attainment. In ALSPAC, results remained non-significant as below.

MS and MDD À subset analysis. To investigate
whether an MDD MS would out-perform complex trait MSs in the discrimination between MDD cases and controls, we additionally trained an MDD MS in a subset of individuals with DNAm data in GS (N=5078, MDD=1223), where 78 CpGs were selected (Supplementary Table 11). We then calculated an MDD MS in a second subset (N=4432, MDD=408). See Figure 3 for a Receiver Operating Characteristic (ROC) curve showing the ability of each complex trait MS and MDD MS to discriminate between MDD cases and controls. The two complex trait MS that outperformed MDD MS performance (AUC=0.553) are smoking (AUC=0.569) and educational attainment (AUC=0.585). We applied a DeLong test to identify whether this outperformance is Where available, R 2 is calculated for MS at different thresholds (educational attainment, smoking status, alcohol consumption). MW=methylome-wide (Bonferroni-corrected CpGs). * = p-value < 0.05. statistically significant. Educational attainment and smoking MS were both non-significant when compared to a MDD MS in predicting MDD: educational attainment D=1.814, p=0.07; smoking D=0.913, p=0.36, indicating that although the AUC for the two complex traits is higher, the difference is not statistically significant.

Colocalization analysis. Colocalization analysis indi-
cated that there was no strong evidence (PP 4 >0.8, PP 4 / PP 3 >5 36 ) for a single SNP being associated with both MDD and DNAm at CpGs encompassing the MS. The posterior probability for one region was supportive of a suggestive co-localized association signal for both MDD and DNAm in that region (PP 4 =0.71). 37 Within this region, the SNP with the highest posterior probability of being a causal SNP (66%) was rs73163796, which colocalized with genetic variation influencing a smokingassociated CpG site, cg15099418. 13 Supplementary Excel File 1 contains results for all 102 regions investigated in colocalization analysis.

Discussion
We created MSs for 6 environmental and biochemical risk factors for MDD, namely HDL and total cholesterol, BMI, educational attainment, smoking status, and alcohol consumption, in two cohorts, GS (N=9,502) and ALSPAC (N=565). Methylome-wide scores, and where available, scores at multiple p-value thresholds (educational attainment, smoking status, and alcohol consumption), showed significant associations with their corresponding traits and with MDD after adjustment for their phenotypic counterparts. Most findings attenuated and became non-significant after adjustment for further lifestyle factors. Smoking and education MS marginally outperformed a MDD MS in discriminating between MDD cases and controls in a GS sub-sample (N=4,432). Finally, colocalization analysis showed that genetic variants are shared between a smoking associated CpG site (cg15099418) and MDD.
Each MS was significantly associated with its corresponding phenotype in both cohorts (GS b range =0.089À1.457; ALSPAC b range =0.078À2.533). All training MWASs consisted of large sample sizes (N range =725 (HDL and total cholesterol) À 15,907 (smoking status)), included relevant covariates, and results were consistent where replication cohorts were included (see Supplementary Table 1). All of the training MWAS for phenotypes investigated here were sufficiently predictive of the trait in our independent samples (see Tables 4 and 5). The variance explained by MS was 5% for HDL cholesterol and 12%À27.3%      alcohol consumption in Braun et al. 9 and Liu et al., 11 respectively. Other studies that applied penalised regression to derive methylation predictors of environmental factors identified similar proportions of variance explained: 12.5% for BMI and alcohol consumption, 60.9% for smoking, 2.5% for educational attainment, 2.7% for total cholesterol, and 15.6% for HDL cholesterol. 14 These results are consistent with findings here. In contrast, we previously found that a MDD MS explains 1.75% of the variance in MDD, and attenuates when including lifestyle factors (0.68%). 7 This indicates that, although there is evidence of an association between DNAm and MDD, the relationship is not as strong as with lifestyle factors, which is in line with previous evidence. 14 MSs calculated at different p-value thresholds (methylome-wide, 0.01, 0.05, 0.1, 0.5) indicated that the most predictive threshold for each trait was the most conservative one. CpGs meeting a less stringent p-value threshold in the score captured less phenotypic variance for each corresponding trait (R 2 : smoking=24.1% and 19.08% for methylome-wide threshold compared to 13.8% and 7.67% for p < 0.05 threshold in GS and ALSPAC, respectively; education=1.25% and 5.26% for methylomewide threshold compared to 0.78% and 5.21% for p < 0.05 threshold in GS and ALSPAC, respectively; alco-hol=5.9% and 3.26% for methylome-wide threshold compared to 0.8% and 2.20% for p < 0.05 threshold in GS and ALSPAC, respectively). Consistent with this pattern, previous studies suggest that the optimal p-value threshold strongly depends on the epigenetic architecture of the trait, as well as on the strength of supporting data. 38 Lifestyle traits such as smoking, alcohol, and BMI show widespread associations with peripheral blood DNAm, 9À13 and indeed MWAS for lifestyle traits investigated here have large sample sizes as well as the largest number of associated CpGs at a methylome-wide threshold. All CpGs significantly associated with educational attainment 12 were also found to be associated with smoking, 13 which may explain the similar association pattern to lifestyle factors.
In GS, each MS was significantly associated with MDD before corresponding trait and lifestyle factor adjustment, although this was only replicated for methylome-wide educational attainment in ALSPAC. When including smoking status in the covariate-free MS-MDD associations, the effects for all complex trait MSs were attenuated but, with the exception of educational attainment, remained significant (Supplementary Table  10). Although the CpGs (N=11) associated with educational attainment, were adjusted for smoking status in the original MWAS, 12 all 11 were also smoking-associated CpGs, indicating that the two traits share an epigenetic signature and that self-reported smoking status may not be sufficient to correct for smoking signals. 33 In GS, the epigenetic signature of each trait explained additional variance to its phenotypic counterpart in MDD, although effect sizes were small across all traits (methylome-wide brange =0.053À0.145). For traits where MSs were available at multiple p-value thresholds, the variance explained increased for MSs that included CpGs at a larger p-value threshold for smoking (R 2 =0.03% methylome-wide MS to 0.07% p < 0.05) and alcohol consumption (R 2 =0.05% methylome-wide MS to 0.1% p < 0.05) and decreased for educational attainment (R 2 =0.28% methylome-wide MS to 0.20% p < 0.05). This suggests that for lifestyle traits with widespread effects on the methylome, including a larger number of associated CpG sites increased prediction accuracy for MDD, although the effect size did not differ significantly (smoking b methylome-wide MS =0.053 to smoking b p < 0.05 =0.072; alcohol consumption b methylome-wide MS =0.059 to alcohol consumption b p < 0.05 =-0.097; educational attainment methylome-wide MS =-0.145 to educational attainment b p < 0.05 =-0.117).
After adjusting for lifestyle factors that are known to associate with MDD and DNAm (smoking, pack years, BMI, and alcohol consumption), methylome-wide educational attainment and alcohol consumption at 4 pvalue thresholds (p < 0.01-0.5) remained significantly associated with MDD, suggesting that certain lifestyle factors may attenuate the relationship between epigenetic signatures of specific traits and MDD. This is surprising given that smoking is known to have much larger effects on the methylome than alcohol consumption 13 . However, smoking was included as a covariate in all previous MWASs used for the MSs calculated here, while the smoking MWAS 13 did not adjust for any lifestyle factors, suggesting that the smoking MS captures other environment-related CpGs whose effect is attenuated by phenotypic measures of lifestyle factors. This pattern of results is consistent with previous studies where disease-relevant phenotypes attenuate associations between MS for complex traits and disease. For instance, Yu et al. investigated associations between smoking MS and lung cancer before and after adjustment for phenotypic smoking status and pack years. They found that the phenotypes attenuated the association between smoking MS and lung cancer, with odds ratio decreasing across different risk score quartiles. 15 Here, we showed that MS for complex traits enhance MDD risk prediction when added to phenotypic measures of these traits. DNA methylation may represent an archive of exposure to environmental factors that are relevant to MDD and may contribute to disease vulnerability. However, lifestyle factors may play an important role in this relationship. Here, they were shown to attenuate the association between MDD and complex trait MS, indicating that they may interact with widespread DNAm in their association with MDD. This is not surprising, as we have previously found that a MDD MS was significantly associated with smoking and alcohol consumption. 7 Although in our previous study the MDD MS enhanced MDD risk prediction when modelled alongside lifestyle factors, here complex trait MS effects are attenuated in the same scenario. This may show that phenotypic measures of lifestyle factors play a greater role in MDD development than methylation marks for complex traits. However, effect sizes here were small, and further studies will be needed to determine the extent of the role DNA methylation plays in MDD.
While the association between each MS and their corresponding phenotype was replicated in ALSPAC, analyses investigating MDD were not. When the MS was modelled together with its phenotypic counterpart, effects were in the same direction with GS across all traits with the exception of BMI, which was positive in GS but negative in ALSPAC. The opposite pattern was shown in terms of variance explained for MS calculated at multiple thresholds, where explained variance in MDD decreased with a less stringent threshold for smoking (R 2 =0.06% methylome-wide MS to 0% p < 0.05) and alcohol consumption (R 2 =0.32% methylome-wide MS to 0% p < 0.05) and increased for educational attainment (R 2 =0.14% methylome-wide MS to 0.97% p < 0.05). The variance explained in MDD did not exceed 1% for any of the traits in GS. Larger cohorts may, therefore, be required to elucidate the link between MDD and epigenetic signatures of lifestyle factors.
The MWASs used to create MS in the current study uncovered CpGs localised to a number of genes that may be of relevance to MDD. The MWAS of educational attainment identified genes implicated in neuronal, immune, and developmental processes 12 ; alcohol consumption-associated CpGs were localised to genes involved in cellular response to stress and chemicals, and immune functions 11 ; BMI-associated CpGs were linked to genes that played a role in lipid metabolism, inflammation, metabolic, cardiovascular, respiratory, and neoplastic disease 10 ; smoking-related methylation marks were localised to genes implicated in smokingrelated diseases (osteoporosis, colorectal cancers, chronic obstructive pulmonary disease, pulmonary function, cardiovascular disease, rheumatoid arthritis) 13 ; finally, HDL and total cholesterol-associated CpGs were annotated to genes implicated in cholesterol metabolism. 9 Most processes identified in these MWAS have also been previously associated with MDD and antidepressant use, specifically immune and neuronal processes. 2,39 It is therefore possible that some associations between MDD and complex trait MS in this study may arise as a result of the processes in which the genes above participate, although further studies are needed to confirm this.
Differences in results between the two cohorts may be attributable to sample size (N GS =9,502; N AL-SPAC =565) and phenotypic differences. Firstly, the ALSPAC sample consists of women only. However, analyses restricted to women in GS (N=5,615) showed similar results to the sex-adjusted analyses in GS (see Supplementary Table 9), indicating that the lack of replication may be due to factors such as the much smaller sample size in ALSPAC (N=565) rather than sex. Further, although the replication sample was matched in age (GS mean age =49.82, ALSPAC mean age =47.96), there were differences in lifestyle factors between the two cohorts. For instance, 18% of participants in GS smoked at the time of blood draw, as opposed to 8% in ALSPAC; 28% and 24% of individuals held a university degree in GS and ALSPAC, respectively. In addition to this, all participants in ALSPAC had some form of education qualification, whereas 8% of GS participants held no qualifications. Finally, BMI was lower in ALSPAC (mean=24.99) as compared to GS (mean=26.89).
Further, similarly to GS, all MSs in ALSPAC explained a significant proportion of variance in their corresponding phenotypic traits, with non-replicating analyses occurring only when MDD was investigated. MDD was assessed differently in the two cohorts: in GS, this was measured using SCID, while in ALSPAC, MDD status was determined by classifying participants with a score of >13 on the EPDS as cases. Previous studies have shown that EPDS approximated SCID-based prevalence overall, although considerable heterogeneity between cohorts may play a role in this approximation. 40 A MDD MS tested in a GS sub-sample (N=4,432) was outperformed by smoking and education MSs in predicting MDD, although predictive values were low for all MS (MDD AUC =0.553, smoking AUC =0.569, education AUC =0.585). DNAm is highly predictive of smoking, 13 and there is a strong overlap of smokingassociated CpGs in the educational attainment MWAS used to calculate the MS. 12 The two predictors are also highly correlated (r=-0.720). Results therefore suggest that epigenetic signatures of lifestyle traits showing more widespread associations with DNAm are marginally more predictive of MDD than an MDD-specific predictor, although current results are limited by lack of large MWAS of MDD.
There are several key strengths to this study. Firstly, GS is one of the largest population-based cohorts containing DNAm and a broad range of lifestyle, disorder, and environmental variables. Secondly, this study provides insight into associations between MSs for lifestyle and biochemical factors in relation to MDD across multiple p-value thresholds. Finally, we used a second large, population-based study as a replication cohort, which similarly contains a range of lifestyle and environmental variables, in addition to DNAm.
Despite these strengths, a number of potential limitations to the current study also need to be considered. Firstly, although GS uses the EPIC array, capturing DNAm at approximately 850K sites, previous MWAS are limited by use of the 450K array, which measures methylation at 450K CpGs. Using a larger array may improve the predictive accuracy for environmental traits, which in turn may lead to more precise associations in relation to MDD. Secondly, DNAm was collected from blood samples in both cohorts and in all previous MWASs, which may not be the most relevant tissue for MDD. However, previous studies have shown robust associations between peripheral blood-based methylation predictors and MDD. 6,41,42 Participants in GS and ALSPAC are predominantly of European ancestry, and the generalisability to diverse ancestries is unknown. Finally, as mentioned above, ALSPAC contained only women who smoked less and had a lower BMI than GS participants; GS participants reported a higher level of educational attainment than ALSPAC women. Although stratifying the GS cohort by women only indicated that results are not due to sex differences, results here may be due to these phenotypic differences, and future studies should select replication cohorts that are analogous to the training cohort.
In the current study we showed that epigenetic signatures of lifestyle and biochemical factors are associated with MDD after adjustment for their phenotypic counterparts, but not when including a broader number of lifestyle factors. Results were not replicated in a second cohort, which may be due to phenotypic differences compared to the main cohort as well as the much smaller sample size. Lifestyle variables are significant in terms of DNAm-related risk to MDD, and efforts should be made in future to disentangle the relationship between these lifestyle factors, DNAm, and MDD. Our study demonstrates the value and necessity of large DNAm datasets for discovery and replication within and between cohorts.

Contributors
MCB, HCW, and AMM were responsible for the project conceptualisation, methodology and validation. MCB, ASFK, MJA, and AC carried out the data curation. CA, RMW, SWM, and KLE were responsible for DNA methylation data quality check and pre-processing in Generation Scotland and ALSPAC. CL, JvD, and MG were responsible for providing summary statistics for the current study. JLM was responsible for providing results in the form of summary statistics (GoDMC). CR and DJP are project directors for ALSPAC and Generation Scotland, respectively. MCB, HCW, and AMM were responsible for the decision to submit. MCB was responsible for formal analysis and writing the original draft and visualisation. XS additionally verified the underlying data and analysis. MCB, CA, ASFK, XS, MJA, DMH, RMW, SWM, JLM, CL, JvD, MG, CR, DJP, AC, KLE, HCW, and AMM reviewed versions of the manuscript. MCB, HCW, and AMM were responsible for manuscript editing and review. AMM and HCW were responsible for supervision, project administration, resources, and funding acquisition. MCB, CA, ASFK, XS, MJA, DMH, RMW, SWM, JLM, CL, JvD, MG, CR, DJP, AC, KLE, HCW, and AMM were not precluded from accessing data in the study and they accept responsibility to submit for publication. All authors read and approved the final version of the manuscript.

Data sharing statement
Data used in the preparation of this article were obtained from the Generation Scotland (GS) cohort (https://www.ed.ac.uk/generation-scotland) and the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort (http://www.bristol.ac.uk/alspac/).
Qualified researchers can request access to GS data through a research proposal, accessed at the following link: https://www.ed.ac.uk/generation-scotland/forresearchers/access and to ALSPAC, accessed at the following link: http://www.bristol.ac.uk/alspac/research ers/access/.
Scripts for the analyses in this project can be found in the Supplementary RMarkdown file. We are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. Part of this data was collected using REDCap, see the REDCap website for details https://projectredcap.org/resources/citations/). Intelligent Systems, Medical University of Vienna, Vienna, Austria